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Abstract — The use of domain knowledge is generally found 
to improve query efficiency in content filtering applications. In 
particular, tangible benefits have been achieved when using 
knowledge-based approaches within more specialized fields, 
such as medical free texts or legal documents. However, 
the problem is that sources of domain knowledge are time- 
consuming to build and equally costly to maintain. As a 
potential remedy, recent studies on Wikipedia suggest that this 
large body of socially constructed knowledge can be effectively 
harnessed to provide not only facts but also accurate informa- 
tion about semantic concept-similarities. This paper describes a 
framework for document filtering, where Wikipedia 's concept- 
relatedness information is combined with a domain ontology to 
produce semantic content classifiers. The approach is evaluated 
using Reuters RCVl corpus and TREC-11 filtering task def- 
initions. In a comparative study, the approach shows robust 
performance and appears to outperform content classifiers 
based on Support Vector Machines (SVM) and C4.5 algorithm. 

Keywords -Wikipedia; Semantic; Concept-relatedness; SVM; 
Ontology; Named-entity recognition 

I. Introduction 

Recently, ontologies have become a broadly accepted 
solution for integrating semantic knowledge into document 
modeling tasks. By using ontologies as a source of back- 
ground knowledge, the IR expert systems have achieved 
increased contextual understanding and ability to do accurate 
conceptual indexing. Yet, these advantages are not gained 
without time-consuming ontology engineering. To reduce the 
costs of managing complicated knowledge models, there is 
an ongoing quest for alternative approaches. Therefore, an 
emerging trend is to consider the use of socially developed 
sources of semantic information, such as Wikipedia, to 
complement expensive domain ontologies; see Medelyan et 
al. Q. 

In this paper, we propose a new framework for docu- 
ment filtering, Wiki-SI^ where Wikipedia-based concept- 
relatedness information is integrated with a domain ontology 
to produce semantic document classifiers. In certain sense, 
this approach can be summarized as a rule-based filtering 
model where the filtering criteria are represented by "se- 
mantified" boolean queries. Here, the difference between an 
ordinary boolean query and a semantic filtering rule is that 
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by using concept-relatedness information a semantic rule 
can match also such documents which do not explicitly 
feature the original query concepts. Thus, the main idea 
is essentially quite simple; each semantic rule provides a 
model for an "implicit expansion" of the original query by 
allowing the rule to match/accept also concepts which are 
not mentioned in the original query. The acceptance is done 
on the condition that the filtered document contains concepts 
which are strongly related to the concepts constituting the 
rule. 

The evaluation of the Wiki-SR framework was carried out 
using Reuters RCVl corpus. The data set was chosen due to 
the relevance judgements and topic definitions supplied by 
the assessors of TREC- 1 1 filtering track. As benchmarks, we 
used the Support Vector Machines (SVM) and the decision- 
tree algorithm C4.5, which are well-known for their solid 
performance. Both algorithms were built using several dif- 
ferent feature sets ranging from bag-of-words to Wikipedia- 
and ontology-based document models. As primary perfor- 
mance measures, we used F-score, precision, and recall. The 
overall result appeared very positive for the heuristic Wiki- 
SR model, which outperformed the benchmarks in terms of 
F-score by a fair margin. 

The rest of this paper is organized as follows. Section HIl 
gives a short review of related work and summarizes the 
contributions of this paper. Section HiH provides an overview 
of the Wiki-SR framework. The components of the doc- 
ument model used by the semantic rules are introduced 
in Section HV] The notion of concept-relatedness measures 
and the definition of the Wiki-SR model are presented in 
Section [V] An experiment based on the algorithm is given 
in Section [yll We conclude in Section [VTIl 



II. Related work and contributions 

Today, Wikipedia is increasingly recognized as a valuable 
source of semantic knowledge for various natural language 
processing tasks; see Medelyan et al. HI for a comprehensive 
review. As pioneering research in this field, we acknowledge 
the work done by Milne et al. IJ], [Sl, Gabrilovich and 
Markovich E), 0, Medelyan et al. IS, i3, Mihalcea and 
Csomai lH, and Strube and Ponzetto 0, who have exam- 
ined different ways of using Wikipedia to compute semantic 



relatedness between concepts and perform automated cross- 
referencing of documents. 

However, considering the large potential offered by 
Wikipedia, surprisingly little research has examined its use 
for document profiling, clustering and classification tasks. 
Perhaps, the best known papers, where Wikipedia has been 
used for information retrieval tasks, are the studies on query 
expansion by Gregorowicz and Kramer lITOl and Milne et 
al. lITTll . Later, these have been followed by research on how 
pseudo-relevance feedback and explicit semantic analysis 
can be used to improve queries; see Li et al. lfT2l and 
Egozi et al. ifTSl . Among the latest studies are also the 
papers by Wang et al. lfT4l . |[T5l where semantic kernels 
are derived from Wikipedia to be used in SVM classifiers 
and co-clustering methods. 

In this paper, our main contribution to the existent litera- 
ture is the introduction of Wikipedia-based semantic rules 
for document filtering. This technique capitalizes on the 
simplicity of ordinary boolean queries but improves it by 
performing an implicit expansion to take into account the 
actual semantic meanings of the concepts involved in the 
query. However, the idea in our Wiki-SR framework is quite 
different from what has become known as query expansion 
as considered by Milne et al. il II . Whereas explicit query 
expansion is commonly defined as addition of terms and 
phrases to the original query phrase to produce a more 
comprehensive and also more complex expression, we never 
add new terms to the original query. Instead, the synonyms 
and closely related concepts are taken into account implicitly 
through similarity measures in the first evaluation step of 
the semantic rule. Furthermore, although the Wiki-SR rules 
in certain sense build a semantic kernel to model concept- 
relatednesses, the system is closer to a semantic boolean 
query than a kernelised SVM-classifier. 

The second contribution of this paper is concerned with 
the way of modelling document content. The model de- 
scribed in Section |IV] combines three different approaches: 
a Wikipedia, an ontology, and the classical bag-of-words 
content models. Here, the Wikipedia-based content model 
is further divided into sub-models representing general 
concepts and named-entities (NE) by using a Conditional 
Random Fields (CRF) classifier The benefit is that this 
separation allows us to take into account the inherent differ- 
ences in the narrowness of concept definitions. In addition to 
Wikipedia, we also utilise a small business ontology (BTO) 
to account for specialized economic concepts which are not 
equally well captured by Wikipedia. The BTO ontology also 
provides a well-defined hierarchy, which has proven to be 
effective in defining Wiki-SR rules. 

III. WlKI-SR FRAMEWORK 

The Wiki-SR framework is an interactive content filtering 
system that combines the relevance statements supplied 
by the user with the concept-relatedness information in 



Wikipedia to produce semantic rules for identifying the 
documents that match the given topic. To summarize the 
steps involved in the filtering process, we split the overview 
of the framework into two parts: (1) the content modeling 
component; and (2) the Wikipedia-based semantic rule com- 
ponent. 

The first component, content modeling, is shown in Fig- 
ure [T] Once an incoming document has been preprocessed, 
the profile is constructed in three parts: a Wikipedia-content 
model (Section lIV-Ab . an Ontology-content model (Sec- 
tion lIV-Bl i. and the classical Bag of Words (BOW) rep- 
resentation. Together, these constitute the document model 
(Section llV-Cb used for filtering tasks. Although, there is 
overlap between the models, they tend to capture different 
aspects in the document, which makes them complementary. 
The resources used in profiling consists of the Wikipedia's 
link-structure (Wiki DB), the business term ontology BTO 
(Ontology DB), and named-entity recognizer (NER). For 
further details on document model, see Section IIVI 
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Figure 1. Content modeling component 

The second component, Wikipedia-based semantic rule, 
is described in Figure ID The purpose is to represent 
the user's information needs as compositions of standard 
boolean queries, which are expressed in terms of Wikipedia 
and ontology concepts. The resulting rule is referred to 
as a Wikipedia-based semantic rule (Wiki-SR rule), which 
represents the topic of user's interest. 

In order to build the rules (Wiki-SR builder in Figure |2]), 
the user is expected to supply a topic statement (see Fig- 
ure O defining the central concepts, and a small set of 
relevant/irrelevant example documents that can be used as 
a training data for learning the Wiki-SR rule that best 
describes the given topic. Each topic statement stands for 



a single topic by providing a short textual description of the 
concepts which are relevant or irrelevant. For implementa- 
tion of the Wiki-SR builder, see discussion in Section IV-BI 
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Figure 2. Semantic rale component 

Once the semantic rule has been learned, it is given to 
an evaluator (Wiki-SR evaluator in Figure |2]i which checks 
whether the incoming documents match the given rule based 
on their profiles. This is the stage where the main benefit 
of constructing the rules in terms of Wikipedia's concepts 
is realized. While checking the potential matches, the eval- 
uator uses Wikipedia's concept-relatedness information to 
judge whether the query's concepts are present in the given 
document. 

Number: Rill 
Telemarketing practices U.S. 

Description: 

Find documents which reflect telemarketing practices in the U.S. which are 
intrusive or deceptive and any efforts to control or reguiate against them. 

Narrative: 

Telemarketing practices found to be abusive, intrusive, evasive, deceptive, 
fraudulent, or in any way unwanted by persons contacted are relevant. 
Only such practices in the U.S. are relevant. All efforts to halt these prac- 
tices, including lawsuits, legislation or regulation are also relevant. 

Figure 3. Example topic statement 



IV. WlKIPEDIA-ENHANCED CONCEPTUALIZATION 

Research on ontology-based knowledge models has been 
largely motivated by their ability to provide unique defini- 
tions for concepts, their relationships and properties, which 
together create a unified description of a given domain. 
However, the use of ontologies has been limited by the 
large engineering costs, which has stimulated increasing re- 
search on socially or automatically constructed knowledge- 
resources. In this section, we describe a hybrid document 
model, where Wikipedia is used in conjunction with a small 



ontology-based document model and the classical bag-of- 
words representation. 

A. Wikipedia-based content model 

Given a preprocessed document, we start building a doc- 
ument model by first detecting the Wikipedia-concept^ that 
best represent its contents. To identify the possible concepts, 
we use the machine learning approach described by Milne et 
al. in which is a refined version of the algorithm suggested 
by Medelyan et al. ||6l. There, the idea is to train a two-stage 
classifier, where the first classifier (disambiguator) recog- 
nizes terms that should be linked and the second classifier 
(link detector) decides where those terms should link to. This 
automatic cross-referencing process is commonly known as 
topic indexing or wikification: 

Definition IV- A. 1 (Wikifier): Let W be an instantiation of 
Wikipedia. Wikifier is defined as a set-valued mapping 
Wikify : V ^ W, from the collection of documents V 
to the set of all Wikipedia-concepts. That is, for a given 
document d £V, Wikify (d) C is a collection of links to 
Wikipedia-articles. 

In its current form, Wikifier makes no difference between 
general concepts and named-entities. However, as we ac- 
knowledge in Section |V] there is a considerable difference 
in the specificity of a concept which is a named-entity, e.g. 
"Goldman Sachs", and a general concept, e.g. "Investment 
banking". For instance, to say that a certain document 
discusses Goldman Sachs practically requires that the bank's 
name is explicitly mentioned. But, to say that a document is 
about investment banking is considerably more relaxed; it is 
sufficient to find a collection of investment banking related 
concepts rather than the exact concept name to identify 
the document as relevant. Clearly, this should be taken 
into account when specifying the sensitivity of semantic 
classifiers to different concept types. Therefore, we train a 
named-entity recognizer to complement the Wikifier. 

Definition IV-A.2 (Named-Entity Recognizer): Let S de- 
note a language such that C S. The named-entity rec- 
ognizer is defined as a set-valued mapping, NER : 2? ^ E, 
from documents to the collection of all n-grams in language 
E which can be interpreted as named-entities. 

Remark IV-A.3: The NER-mapping is implemented as the 
Conditional Random Fields (CRF) -based classifier proposed 
by Finkel et al 1161 . An advantage of this model is that 
the system is able to augment non-local information, which 
allows for long-distance dependency models and enforcing 
of label consistency. 

^The notion Wikipedia-concept is used interchangeably with Wikipedia- 
article, because each article in Wikipedia represents a single topic/concept 
with a short title. In effect, this amounts to considering Wikipedia as a very 
large thesaurus consisting of the terms derived from the titles of all articles. 

'in the rest of the paper, the notation ^ is used to denote a set- valued 
mapping. 



Finally, having obtained both the set of Wikipedia con- 
cepts and the set of named-entities, we construct the Wiki- 
content model as a combination of the general Wikipedia- 
concepts and Wikipedia named-entities. 

Definition IV-A.4 (Wiki-content model): Let W be an in- 
stantiation of Wikipedia. The Wiki-content model is defined 
as the set-valued mapping. Aw '■ T) =^ W, which is given 
by the union, Kw{d) = AAr(d) U Kcid), where 

(i) the model for Wikipedia named-entities, : V =^ W , 
is given by AAr(fi) := NER(rf) n Wikify(rf); and 

(ii) the model for general Wikipedia concepts, Ag : 2? =^ 
W, Acid) := Wikify(d) \ AAr(d), corresponds to the 
remainder of concepts identified by Wikifier. 

B. Ontology-based content model 

The ontology model considered in this paper is derived 
from the Business Term Ontology (BTO) proposed by Malo 
and Siitari El. The primary purpose of the BTO ontology 
is to provide the system with a solid taxonomy of business 
domain concepts, and allow explicit expression of generality 
vs. specificity of concepts through subclassing relation. 

The BTO-ontology model is built using the RDFS exten- 
sion proposed by Suchanek et al. ifTsll , where an ontology 
O is defined as an injective mapping from a finite set of 
fact-identifiers to fact-triplets. This definition allows a very 
general description of an ontology as a graph, where the 
nodes may be either entities (e.g. concepts such as Option- 
Contract, PutOption, CallOption), relations (e.g. subClassOf, 
hasWikiPage) or fact-identifiers. The basic element in the 
BTO model is thus an entity which may refer to any abstract 
or concrete thing. Throughout, we also assume that the 
entities are discernible and we can tell whether two entities 
are the same. 

Following the notations introduced in the previous section, 
we can now define the BTO-content model as a simple set- 
valued mapping: 

Definition IV-B.l (BTO-content model): Let Obto de- 
note the current instantiation of the BTO-ontology, and 
C C S be the set of ontology concepts expressed in language 
S. The ontology-based content profiler is defined as a set- 
valued mapping, Aq : 2? =^ C, from the collection of 
document V to the set of BTO-concepts. Then the BTO- 
content model for document d E V is given by the set of 
concepts Aq (d) C C produced by the profiler 

C. Document model 

The full document model is then obtained as a combina- 
tion of the Wikipedia and ontology based content models, 
which are augmented with the classical Bag of Words 
(BOW) representation. 

Definition IV-C.l (Document model): Let Aw{d) and 
Aq (d) denote the Wikipedia and BTO content models for an 
active document d eT), respectively, and let As : 2? ^ S, 
As(d) C S, be the bag of words (BOW) representation of 



document d for the given language E. The document model 
is given by 

A{d) := {Aw{d),Ao{d),A^{d)) 

which is interpreted as a sparse vector in M^, where N is the 
joint cardinality of Wikipedia W, BTO-ontology 's concept- 
set C, and the language E. 

This choice of document model leads to a standard 
vector-space representation, where the presence/absence of a 
Wikipedia-concept, BTO-concept or a word is indicated by 
ones and zeros. Although, the above form could be easily 
changed to support some weighting scheme, such as Tf-Idf, 
we have left them as a question for further research. For 
the purpose of the current experiment our primary interest 
is in the benefits obtained from the use of Wikipedia-based 
relatedness measures in detection of relevant documents. 

V. Semantic filtering with Wiki-SR 

To illustrate the notion of semantic filtering in Wiki-SR, 
let's consider the sample topic statement (see Figure [3]), 
where the goal is to filter documents reporting telemarketing 
abuses in U.S. Now, by reading the statement, one could 
come up with a boolean query to represent the topic; e.g. 
("U.S." * "telemarketing" ★ ("fraudulent" + "legislation" + 
"regulation"!! Clearly, this kind of a word-based boolean 
expression looks reasonable on the surface and could itself 
be used directly, but it is unlikely to yield good recall or 
precision. The problem is that there are large amounts of 
synonymous or strongly related concepts which could appear 
in the documents instead of the original ones. For example, 
if a document has words {"telesales", "Boston", "crime"}, 
then it is likely to be a match because the words are almost 
synonyms or very strongly related to the query - even though 
none of these words appear in the original query. Thus, 
to enrich the original query with semantic knowledge, we 
need to find a way to measure the relatedness between any 
arbitrary pair of concepts and incorporate this idea into the 
evaluation of the query. For this purpose, we have designed 
the Wiki-SR model, which performs an implicit expansion 
of the query by using Wikipedia's relatedness information. 

In order to clarify more closely what Wiki-SR rules are, 
how they are constructed, and evaluated in practice, the 
section is divided into the following parts. In the first part 
(Section IV-AI ). we discuss how Wikipedia can be used to 
compute semantic relatedness between any pair of concepts. 
In particular, we consider how the existing measures can 
be adapted for usage in Wiki-SR rules. In the second part 
(Section IV-Bb . we present the formal definition of Wiki- 
SR model and discuss how it is constructed using the topic 
statement and the set of example documents supplied by the 
user Finally, we describe the steps involved in evaluation 
of the Wiki-SR rules to determine whether a particular 
document matches the rule or not. 

■•and (*), OR (+) 



A. Measuring semantic relatedness 

Although approaches to measuring conceptual relatedness 
based on corpora or WordNet have been around already 
quite long, the use of Wikipedia as a source of back- 
ground knowledge is a relatively new idea. The first step 
in this direction was taken by Strube and Ponzetto |(9l, who 
proposed their WikiRelate-technique that modified existing 
measures to better work with Wikipedia. This was soon 
followed by the paper of Gabrilovich and Markovitch 15], 
who suggested explicit semantic analysis (ESA) to define a 
highly accurate similarity measure using the full text of all 
Wikipedia articles. The most recent proposal is, however, 
the Wikipedia Link-based Measure (WLM) proposed by 
Milne et al. |]2|, E), where only the internal link structure 
of Wikipedia is used to define relatedness. The approach 
is known to be computationally very cheap and has still 
achieved relatively high correlation with humans, which is 
why we have adopted it as a basis for the document-concept 
similarity measure used in this paper Below, we describe 
how semantic relatedness information of Wikipedia-links 
can be incorporated into filtering rules. 

Commonly, a semantic relatedness measure is defined 
between two concepts. However, from our application's 
perspective it is perhaps more interesting to ask: How 
strongly is the given concept related to the document at 
hands? Or how likely is it for the given concept to appear 
in the document? The idea of Milne et al. ||2l, [Sl was to 
construct a low-cost measure for semantic relatedness using 
only the hyperlink structure of Wikipedia rather than its 
category hierarchy or text content. The relatedness measure 
essentially corresponds to the Normalized Google Distance 
inspired by Cilibrasi and Vitanyi (19]: 

Definition V-A.l (Link-relatedness): Let wi,W2 G be 
Wikipedia-concepts, and let Wi , W2 C W denote the sets 
of all articles that link to wi and W2, respectively. The link 
structure -based concept-relatedness measure, link-rel : W x 
— !■ [0, 1] , is then given by 

,. . , log (max H^il, \W2\) - log {\Wi n ^2!) 

Because link-rel is defined only for uniquely identified 
Wikipedia-concepts, we need to extend the definition slightly 
to allow relatedness calculation for any pair of n-grams. That 
is, many words which are recognized as redirects or anchors 
are not counted into the set of Wikipedia-concepts C S. 
Therefore, we consider the following extension of link-rel 
from to E. 

Definition V-A.2 (Between terms -relatedness): Let 
si,S2 e E be two terms. The Wikipedia-based term- 
relatedness measure, rel : E x E — > [0,1], is defined 
as 



where 

Senses(si) 



{w € W : Si \s redirect, anchor, or title 
of w}. 



If Si is uniquely identified, i.e. Si G W, then \Senses{si)\ = 
1. Therefore, if the terms are uniquely identified as 
Wikipedia-concepts, i.e. si,S2 G W, then rel(si,S2) = 
link-rel(si, S2). Also, if Senses(si) = for some Si G E, 
then the term is not recognized by Wikipedia, and we have 
rel(si, Sj) =0 for every Sj G E. 

Finally, recalling that we wanted a measure between 
a document and a concept, we can now use the above 
extension to introduce the following simple definition for 
Wikipedia-based relatedness measure: 

Definition V-A.3 (Document-term relatedness): Let s G 
E and d E T). The Wikipedia-based document-term - 
relatedness measure, d-rel : E x 2? ^ [0, 1], is given by 

d-rel(s, d) = max{rel(s, s) : s G A((i)} 



rel(si,S2) — max{link-rel(?i;i, W2)} 
-012 G Senses(s2)}, 



wi G Senses(si), 



where K{d) is the document model dlV-C.ll ). By taking the 
maximum over A(d) we allow s to be either a Wikipedia- 
concept. Ontology-concept or word in BOW. 

Remark V-A.4: If s G Ao{d), then the relatedness is 
calculated with respect to the Wikipedia-page attached to 
the ontology concept through hasWikiPage-relation, i.e. we 
treat the concept as a Wikipedia-article. If the concept does 
not have Wikipedia-page defined, then it is treated as any 
n-gram. 

The use of maximum, rather than sum-based operator 
such as average, in d-rel is a deliberate choice. Since 
this relatedness measure is intended to be used in filtering 
rules, we do not want to allow sum-operations to mask the 
presence of those concepts in a document which are not 
related to its central story. 

B. Wiki-SR model 

Having introduced the semantic relatedness measure, d- 
rel, we can now provide a more detailed explanation to 
semantic filtering rules. Following our earlier discussion in 
Section HUl we decompose the definition of a semantic rule 
into two parts (see Figure |2]): (1) the rule-builder which 
is responsible for learning the underlying query expression; 
and (2) the rule-evaluator which uses Wikipedia's concept- 
relatedness information to perform an implicit expansion of 
the query to account for strongly related concepts. 

1) Rule-builder: Let V C T, he the set of available 
ontology and Wikipedia concepts, and let Q denote the 
space of all possible boolean query expressions that can 
be formulated using the concepts in V and the boolean 
operators AND (*), OR (+), and NOT (-.). 

Now, assuming that the user has provided a topic state- 
ment t € V and a small training collection of rele- 
vant/irrelevant document examples Dt C V, the rule builder 



is defined as a mapping from the user-inputs to the query 
space, i.e. 

B : {t,Dt)^qe Q, 

where q can contain only those concepts which appear in 
the topic statement. That is, if Vq ~ {vi, . . . , u„} C ^ is 
the set of concepts included in q, then the concepts must be 
such that Vq c Aw(t) U Ao{t). 

Example V-B.l: Suppose that the user has provided the 
topic statement t shown in Figure [3] After wikification of 
the document, we have identified a collection of concepts 
{UnitedStates, Espionage, Fraud, Legislation, Regulation}. 
Given the set of example documents Dt by the user, the rule 
builder could produce a rule B{t, Dt) ~ UnitedStates ★ 
Espionage * [Fraud + Legislation + Regulation). 

The builder mapping B is implemented by using the 
genetic programming (GP) technique proposed by Malo 
et al. II20I that extends the Inductive Query By Example 
(IQBE) paradigm of Smith and Smith II2TI and Chen et 
al. II22I . There, the idea is to use the relevance informa- 
tion collected from the user as fitness cases to find the 
query expression that best separates relevant from irrelevant 
document examples. The learning process is driven by the 
evolutionary pressure that guarantees that only the fittest 
individuals among all potential query candidates survive. In 
this paper, we used F-score as the fitness function to find a 
reasonable balance between precision and recall. 

The reason, why IQBE-based query builders seem to be 
rarely used, is perhaps best explained by the tendency of 
GP to produce overfitted queries. The risk of overfitting is 
high, in particular, when the training sets are small and when 
the number of concepts (or literals/terminals in GP) is large. 
For these reasons, we had restricted the concept set to the 
ones that are detected from the topic statement. Thus, the 
version of GP used in this paper is a special case of the more 
advanced algorithm proposed by Malo et al. 1201 . where it 
is shown that the query learning can be generalized also to 
more realistic cases where predefined topic statements are 
not available. For further details on the use of GP-learning, 
see Koza (231. To find more information on the ways how 
GP can be modified for learning Wikipedia-based queries, 
see the forthcoming paper by Malo et al. 1201 . 

2) Rule-evaluator: The rule-evaluator in Wiki-SR pro- 
vides a matching subsystem for determining whether a 
given document matches the currently active semantic rule. 
Now, assuming that the user's topic definition has been 
transformed by the rule-builder component to a query q £ Q, 
the evaluator is specified as a binary-valued mapping, 

£; : Q X {0,1}. 

which operates in two steps: (i) concept-evaluation step; and 
(ii) expression-evaluation step. 

To outline the procedure, let's suppose that the query ex- 
pression has form q = virv2r ■ ■ ■ rvk, where {vi , . . . ,Vk} C 



V and each r could be replaced by any of the boolean 
operators. Then the evaluation steps can be defined as 
follows: 

(i) Concept-evaluation step: The purpose of the concept- 
evaluation step is to determine whether the query concepts 
are present in the active document d - either directly or 
indirectly. The task is accomplished by a specific concept- 
evaluator function that is applied in turn to every concept in 
the query. 

The concept-evaluator's decision rule is carried out in 
three parts: (a) First, it tries to look whether the query 
concepts are directly featured in the document, (b) If no 
match is found, the rule then searches for related concepts 
or words, which would strongly predict the presence of the 
given concepts. If the d-rel based sensitivity threshold is 
exceeded, then the rule decides that the given concept is 
present in the document, (c) Finally, if these two steps fail, 
it is concluded that the given concept is not present. 

To formalize this idea, we define the concept evaluator as 
function S : V x V ^ {0,1}, 

{1 if V £ Aw{d)U Ao{d), 
1 ifi;GRel(d), 
otherwise 

where Rel(d) = {v £ V : d-rel(w, d) > Crei(w)}, and Ci-ei > 
is a threshold function controlling the acceptance sensitivity 
by relatedness criteria. The threshold for d-rel depends on 
the type of concept, i.e. whether it is a named-entity, general 
Wikipedia-article, or BTO-concept, 

I ci if 7j is a named-entity, i.e. v £ A^id), 

Ci-el(w) = < . 

I C2 Otherwise. 

Each sensitivity threshold is chosen based on training data. 
The purpose of the distinction between named-entities and 
general concepts is to allow stricter thresholds for named- 
entities which have by default narrower definitions than 
general concepts. 

Example V-B.2: Let's continue our Example IV-B.ll 
where the query rule produced by the builder was q = 
UnitedStates -k Espionage * {Fraud + Legislation + 
Regulation). Now, suppose that a new document d 
features concepts {TradeSecret, China, Lawyer}. Then 
the concept-evaluator's task is to find 5{v,d) for ev- 
ery concept V in q. In this particular case, we would 
obtain that 5[Espionage,d) = l,S{Legislation,d) = 
1 because Espionage, Lawyer £ Rel{d). However, 
5{U nitedStates , c?) = because none of the concepts in 
d is strongly related to the U.S. 

(ii) Expression-evaluation step: In Wiki-SR framework, it 
is the concept-evaluator function 5 which does most of the 
work. Once the variables in the query q have been evaluated, 
the expression-evalution step amounts to replacing the orig- 
inal query variables {vi, . . . ,Vk} with the values given by 



the concept-evaluator {S{vi,d), . . . , S{vk, d)}, i.e. we obtain 
that 

E{q, d) := d{vi, d)r5{v2, d)r ■ ■ ■ rd{vk,d). 

Thus, the value of this final expression can then be used 
to decide whether the given document is relevant or not. 
For instance, in the case of the previous example we would 
find that the particular document is not relevant, because 
6(UnitedStates, d) was 0. 



VI. Experiment 



A. Data 



The evaluation of Wiki-SR framework is based on Reuters 
RCVl corpu^ using TREC-ll topic statement^ The corpus 
contains about 800 000 news stories from years 1996-1997. 
Following TREC-ll instructions, the data set is partitioned 
into a training set (items dated between 1996-08-20 to 1996- 
09-30) and a test set (remainder of the collection). The train- 
ing and test set are further divided into 100 topic-specific 
subsets, which are augmented with the relevance judgements 
made by the assessors of TREC-ll. In this paper, only the 
initial training data is used, while the relevance statements 
available for adaptive learning are not utilized. 

B. System 

The system used in the experiment was implemented 
using Java software on top of the GATE |j24l platform, which 
provides tools for standard document preprocessing tasks. 
The Wikipedia-based content model was built using the 
WikipediaMiner published by Milne et al. Il25l . which was 
suitably modified and integrated into our framework. The 
named-entity recognition task was carried out using a Con- 
ditional Random Field (CRF) classifier proposed by Finkel 
et al. lfl6l . For other classification tasks, we used Weka 1261 
through Java-ML ||271 package. The manual ontology editing 
was done in Protegqj framework. All automated ontology 
engineering tasks were done using Sesam^l with MySQL- 
repository. 

C. Results 

In the experiment, the performance of Wiki-SR framework 
is compared against Support Vector Machines (SVM) and 
the decision-tree classifier C4.5. The primary performance 
measures used for comparison are F-score, precision, recall, 
and accuracy. 

The comparison was carried out as follows. The system 
started with the given collection of 100 topics and a set of 
training documents for each topic, where the documents had 
been pre-assigned as relevant or irrelevant by assessors of 
TREC-ll. The task was then to train the classifiers using 



the information in the training-samples and the initial topic- 
statements. Here, each topic was considered separately and 
no cross-topic learning was allowed. 

The construction of Wiki-SR classifiers was implemented 
in two steps. First, in order to obtain the boolean query 
statements expressed in terms of Wikipedia and ontology 
concepts, the topic-statements were profiled and the obtained 
concepts were used to build query rules. Then, the sensitiv- 
ity thresholds required by the relatedness-based acceptance 
criteria were optimised using the training samples. 

In similar fashion, the benchmark classifiers were op- 
timised using only the training data. However, none of 
the benchmark classifiers used information in the original 
topic-statements. As feature sets, three different document 
models were considered: a bag-of-words profile (tokens), a 
Wikipedia profile (Wiki), and a profile where Wikipedia con- 
cepts are augmented with ontology concepts (BTO Wiki). 

Table |I] reports performance measures for models Wiki- 
SR, LibSVM and C4.5. In order to take the varying qual- 
ity of the different topics into account, the results are 
further divided into three subtables based on the ratio of 
positive and negative examples in the training sample, i.e. 
tr = #negative examples/#positive examples. Panel A gives 
results for all topics. Panel B for topics with low ratio 
{tr < 5), and Panel C for topics with high ratio [tr > 5). 

Table I 
Model comparison 

Panel A: Results for all topics 



Model 


Profile 


F-Score 


Accuracy 


Precision 


Recall 


C4.5 


Tokens 


0.28 


0.8 


0.28 


0.37 




Wiki 


0.31 


0.84 


0.35 


0.37 




BTO Wiki 


0.30 


0.83 


0.32 


0.36 


LibSVM 


Tokens 


0.22 


0.89 


0.58 


0.21 




Wiki 


0.25 


0.89 


0.56 


0.23 




BTO Wiki 


0.27 


0.88 


0.54 


0.25 


Wiki-SR 


0.44 


0.84 


0.47 


0.54 


Panel B: 


Results for topic 


s with tr < .5 






Model 


Profile 


F-Score 


Accuracy 


Precision 


Recall 


C4.5 


Tokens 


0.39 


0.74 


0.35 


0.5 




Wiki 


0.41 


0.78 


0.40 


0.48 




BTO Wiki 


0.41 


0.77 


0.40 


0.49 


LibSVM 


Tokens 


0.40 


0.85 


0.56 


0.39 




Wiki 


0.42 


0.85 


0.58 


0.40 




BTO Wiki 


0.44 


0.84 


0.55 


0.43 


Wiki-SR 


0.53 


0.84 


0.56 


0.58 


Panel C: 


Results for topic 


s with tr > .5 






Model 


Profile 


F-Score 


Accuracy 


Precision 


Recall 


C4.5 


Tokens 


0.18 


0.87 


0.20 


0.25 




Wiki 


0.22 


0.89 


0.29 


0.26 




BTO Wiki 


0.20 


0.89 


0.25 


0.23 


LibSVM 


Tokens 


0.05 


0.93 


0.60 


0.03 




Wiki 


0.08 


0.93 


0.54 


0.06 




BTO Wiki 


0.09 


0.92 


0.53 


0.07 


Wiki-SR 


0.36 


0.86 


0.36 


0.5 



Reuters corpus volume 1, http://about.reuters.comresearchandstandardscorpus 
'TREC 2002 Filtering Track Collections, http://trec.nist. govdata 
' http ://protege . Stanford . edu 

http ;//w w w. openrdf . org 



First of all, a general comparison of the models suggests 
,that the Wiki-SR heuristic achieves consistently better results 
than the benchmark algorithms in terms of F-score. See 
Figure |4] for F-score and Recall boxplots computed using 
all 100 topics. When searching for causes, it appears that 



F-Score 



Recall 



C4.5-Token - [ - 



C4.5-BTOWiki 



SVM-BTOWiki 



SVM-BTOWiki 



(a) F-score 



Figure 4. Results for 100 TREC-11 topics 



(b) Recall 



the performance differences are largely explained by the 
recall levels. Whereas the differences in accuracies and 
precisions are relatively small, as observed from Table Ul 
the heuristic Wiki-SR achieves considerably better results in 
terms of recall. Interestingly, when considering a division 
of topics based on the proportion of irrelevant and relevant 
documents in the training sample, we find that the heuristic 
has faired considerably better than its benchmarks on highly 
unbalanced topics; see Panels B and C in Table U This 
observation is possibly explained by that a rule-based model 
such as Wiki-SR is less sensitive to the quality of the training 
sample than for example SVM-classifiers. It is also known 
that imbalance between positive and negative examples in 
training sample can have an adverse effect on traditional 
classifiers. 

Finally, to investigate the effect of document model given 
to the benchmark classifiers, both SVM and C4.5 models 
were built using three alternative profiles with different 
levels of concept information. A quick comparison reveals 
that wikification slightly improves results for all topics as 
measured by F-Score. Especially, when more unbalanced 
topics are considered. However, the case of BTO concepts 
shows mixed evidence. 

VII. Conclusions 

In this paper, we have presented a new document filtering 
framework, Wiki-SR, where Wikipedia's extensive domain- 
knowledge is utilised to produce effective semantic classi- 
fication rules. An empirical experiment based on Reuters 
RCVl corpus and TREC-11 topic statements revealed that 
the use of semantic concept-relatedness information along 
with a suitable document model have a considerable com- 
bined effect on classification performance. The results sug- 
gest that although there are some benefits already in the 



use of a concept-based representation of document's con- 
tents, the profile is not truly effective unless there is also 
knowledge about relationships between different concepts. 
For this purpose, the use of Wikipedia as a source of domain 
knowledge is ideal due to its incredibly dense link-structure 
and broad scope. 

In the future work, we investigate how machine-learning 
can be used to complement our Wikipedia-based approach 
to determining document-concept relatedness. In particular, 
we assume that the techniques used in multi-task learning 
could prove to be very beneficial in this respect. As another 
direction for further development, we are examining how 
the boolean rules used in Wiki-SR can be better extracted 
automatically from text in natural language form. Especially, 
we are interested in considering techniques, where the 
rule structures can be learned without the use of explicit 
topic definitions. One of such directions is examined in 
the forthcoming paper Malo et al. j20l, where a modified 
GP-algorithm is developed to learn Wikipedia-based queries 
using only sample documents supplied by the user 
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